NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Querying Climate Knowledge: Semantic Retrieval for Scientific Discovery

Adamu, Mustapha; Zhang, Qi; Pan, Huitong; Latecki, Longin; Dragut, Eduard (September 2025, the MANILA workshop series at SIGIR)

Free, publicly-accessible full text available September 9, 2026
Taxonomy-Driven Knowledge Graph Construction for Domain-Specific Scientific Applications

https://doi.org/10.18653/v1/2025.findings-acl.223

Pan, Huitong; Zhang, Qi; Adamu, Mustapha; Dragut, Eduard; Latecki, Longin Jan (January 2025, Association for Computational Linguistics)

Full Text Available
ClimateIE: A Dataset for Climate Science Information Extraction

https://doi.org/10.18653/v1/2025.climatenlp-1.6

Pan, Huitong; Adamu, Mustapha; Zhang, Qi; Dragut, Eduard; Latecki, Longin Jan (January 2025, Association for Computational Linguistics)

Full Text Available
DynClean: Training Dynamics-based Label Cleaning for Distantly-Supervised Named Entity Recognition

https://doi.org/10.18653/v1/2025.findings-naacl.137

Zhang, Qi; Pan, Huitong; Chen, Zhijia; Latecki, Longin Jan; Caragea, Cornelia; Dragut, Eduard (January 2025, Association for Computational Linguistics)

Full Text Available
FlowLearn: Evaluating Large Vision-Language Models on Flowchart Understanding

Pan, Huitong; Zhang, Qi; Caragea, Cornelia; Dragut, Eduard; Latecki, Longin J (August 2024, European Conference on Artificial Intelligence (ECAI))

Flowcharts are graphical tools for representing complex concepts in concise visual representations. This paper introduces the FlowLearn dataset, a resource tailored to enhance the understanding of flowcharts. FlowLearn contains complex scientific flowcharts and simulated flowcharts. The scientific subset contains 3,858 flowcharts sourced from scientific literature and the simulated subset contains 10,000 flowcharts created using a customizable script. The dataset is enriched with annotations for visual components, OCR, Mermaid code representation, and VQA question-answer pairs. Despite the proven capabilities of Large Vision-Language Models (LVLMs) in various visual understanding tasks, their effectiveness in decoding flowcharts—a crucial element of scientific communication—has yet to be thoroughly investigated. The FlowLearn test set is crafted to assess the performance of LVLMs in flowchart comprehension. Our study thoroughly evaluates state-of-the-art LVLMs, identifying existing limitations and establishing a foundation for future enhancements in this relatively underexplored domain. For instance, in tasks involving simulated flowcharts, GPT-4V achieved the highest accuracy (58\%) in counting the number of nodes, while Claude recorded the highest accuracy (83\%) in OCR tasks. Notably, no single model excels in all tasks within the FlowLearn framework, highlighting significant opportunities for further development.
more » « less
Full Text Available
SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions

Pan, Huitong; Zhang, Qi; Caragea, Cornelia; Dragut, Eduard; Latecki, Longin (May 2024, COLING)

We present SciDMT, an enhanced and expanded corpus for scientific mention detection, offering a significant advancement over existing related resources. SciDMT contains annotated scientific documents for datasets (D), methods (M), and tasks (T). The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes. To the best of our knowledge, SciDMT is the largest corpus for scientific entity mention detection. The corpus’s scale and diversity are instrumental in developing and refining models for tasks such as indexing scientific papers, enhancing information retrieval, and improving the accessibility of scientific knowledge. We demonstrate the corpus’s utility through experiments with advanced deep learning architectures like SciBERT and GPT-3.5. Our findings establish performance baselines and highlight unresolved challenges in scientific mention detection. SciDMT serves as a robust benchmark for the research community, encouraging the development of innovative models to further the field of scientific information extraction
more » « less
Full Text Available
SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents

https://doi.org/10.18653/v1/2024.emnlp-main.726

Zhang, Qi; Chen, Zhijia; Pan, Huitong; Caragea, Cornelia; Latecki, Longin Jan; Dragut, Eduard (January 2024, Association for Computational Linguistics)

Full Text Available
DMDD: A Large-Scale Dataset for Dataset Mentions Detection

https://doi.org/10.1162/tacl_a_00592

Pan, Huitong; Zhang, Qi; Dragut, Eduard; Caragea, Cornelia; Latecki, Longin Jan (September 2023, Transactions of the Association for Computational Linguistics)

Abstract The recognition of dataset names is a critical task for automatic information extraction in scientific literature, enabling researchers to understand and identify research opportunities. However, existing corpora for dataset mention detection are limited in size and naming diversity. In this paper, we introduce the Dataset Mentions Detection Dataset (DMDD), the largest publicly available corpus for this task. DMDD consists of the DMDD main corpus, comprising 31,219 scientific articles with over 449,000 dataset mentions weakly annotated in the format of in-text spans, and an evaluation set, which comprises 450 scientific articles manually annotated for evaluation purposes. We use DMDD to establish baseline performance for dataset mention detection and linking. By analyzing the performance of various models on DMDD, we are able to identify open problems in dataset mention detection. We invite the community to use our dataset as a challenge to develop novel dataset mention detection models.
more » « less
Full Text Available
SGUNet: Semantic Guided UNet for Thyroid Nodule Segmentation.

Pan, Huitong; Zhou, Quan; Latecki, Longin Jan (April 2021, IEEE Int. Symposium on Biomedical Imaging (ISBI))
null (Ed.)
Full Text Available

Search for: All records